About the project

This course is an open datascience course using Git and R. My github repository is at https://github.com/KatariinaParnanen/IODS-project


Week 2: Data wrangling, regression analysis and model validation

This week we have learned about data wrangling using R and dplyr package. We also are introduced to basic regression analysis and model validation using diagnostic plots.

Reading in data

We will first read in data from local directory. The data is based on a questionaire done on students of a course and their exam points.

#Read in table
lrn2014<-read.table("/Users/kparnane/Documents/IODS_course/IODS-project/data/learning2014.txt", header=TRUE, sep="\t")

We will explore the dataset’s dimension and structure.

#Explore the dimensions
dim(lrn2014)
## [1] 166   7
#Explore the structure of the file
str(lrn2014)
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
##  $ age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: int  37 31 25 35 37 38 35 29 38 21 ...
##  $ points  : int  25 12 24 10 22 21 21 31 24 26 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...

The data has 166 observations and 7 variable columns.

The dataset has a response variable “points”" and explanatory variables “age”, “attitude”, “deep”, “surf” and “stra”. “deep” means points for deep learning questions, “surf” points for surface learning questions, “stra” means strategic learning techniques. “points” means exam score. More info on the dataset can be found here.

Data exploration

We will explore correlations and distributions of explanatory and response factors using “ggpairs” which plots pairwise correlations and variable disturbutions.

#Load required packages
library(GGally)
library(ggplot2)

#Draw plot
ggpairs(lrn2014, mapping = aes(col = gender, alpha=0.3), lower = list(combo = wrap("facethist", bins = 20)))
__Pairwise correlation plot__: Exploration of variables and correlations

Pairwise correlation plot: Exploration of variables and correlations

#Get summaries of variables
summary(lrn2014)
##  gender       age           attitude         points           deep      
##  F:110   Min.   :17.00   Min.   :14.00   Min.   : 7.00   Min.   :1.583  
##  M: 56   1st Qu.:21.00   1st Qu.:26.00   1st Qu.:19.00   1st Qu.:3.333  
##          Median :22.00   Median :32.00   Median :23.00   Median :3.667  
##          Mean   :25.51   Mean   :31.43   Mean   :22.72   Mean   :3.680  
##          3rd Qu.:27.00   3rd Qu.:37.00   3rd Qu.:27.75   3rd Qu.:4.083  
##          Max.   :55.00   Max.   :50.00   Max.   :33.00   Max.   :4.917  
##       surf            stra      
##  Min.   :1.583   Min.   :1.250  
##  1st Qu.:2.417   1st Qu.:2.625  
##  Median :2.833   Median :3.188  
##  Mean   :2.787   Mean   :3.121  
##  3rd Qu.:3.167   3rd Qu.:3.625  
##  Max.   :4.333   Max.   :5.000

There are more females than males. Other factors besides age seem to roughly follow a normal distribution. Mean age is 25. We are interested in the response variable points and how the explanatory variables corrrelate with it. The highest correlations are with “attitude”, “stra” and there is a negative correlation with “surf”.

Linear models

Now we use the three variables with possible correlations with exam points identified in the data exploration in linear models.

# fit a linear model
my_model <- lm(points ~ attitude+stra+surf, data = lrn2014)

# print out a summary of the model
summary(my_model)
## 
## Call:
## lm(formula = points ~ attitude + stra + surf, data = lrn2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.01711    3.68375   2.991  0.00322 ** 
## attitude     0.33952    0.05741   5.913 1.93e-08 ***
## stra         0.85313    0.54159   1.575  0.11716    
## surf        -0.58607    0.80138  -0.731  0.46563    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08

Variable “attitude” is positively correlated with points with a p-value of 1.93e-08 and estimate 0.33952. The other variables are not significantly correlated with exam points so they can be excluded from the model.

Drop the not significant variables and fit the model again.

# fit a linear model
my_model2 <- lm(points ~ attitude, data = lrn2014)

#Plot the regression line in scatter plot with exam points versus attitude
qplot(attitude, points, data = lrn2014) + geom_smooth(method = "lm")
__Scatter plot with regression line__: Relationship of points and attitude

Scatter plot with regression line: Relationship of points and attitude

Print summary of the model.

# print out a summary of the model
summary(my_model2)
## 
## Call:
## lm(formula = points ~ attitude, data = lrn2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.63715    1.83035   6.358 1.95e-09 ***
## attitude     0.35255    0.05674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

There is a significant relationship with the variable attitude exam points. The attitude variable has a p-value of 1.95e-09, estimate of 0.35255 and standard error of 0.05674. The R-squared value is 0.1856 meaning that attitude variable explains roughly 19% of the variation in the model.

Diagnostic plots

We will produce diagnostic plots to see if the model fits the assumptions of linear models. The assumptions are that the size of the error is not dependent on the variable value and the errors are normally distributed and that the explanatory variables are not correlated with each other. The dependent variables should also be independent from each other.

# Draw diagnostic plots using the plot() function. Choose the plots 1, 2 and 5
par(mfrow = c(2,2))
plot(my_model2, which=c(1, 2, 5))
__Diagnostic plots__: Plots for exploring model errors

Diagnostic plots: Plots for exploring model errors

There doesn’t seem to be any patterns in the residuals vs fitted values plot so the size of the residuals are not dependent on the fitted values. The residuals in the Q-Q plot follow the assumed regression line quite well so the assumption of normal distribution of errors is filled. The residuals vs leverage plot doesn’t reveal that there are any observations which have a unusually high effect on the model. The model also has only one explanatory variable so explanatory variable autocorrelation is not a problem. The observations are also independent from each other.Thus, we can conclude that the model fits the assumptions.


Week 3: Logistic regression

This week we have learned about logistic regression, piping and model validation.

Reading in data

Read in data and explore.

# Get access to useful libraries
library(dplyr);library(ggplot2);library(tidyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:GGally':
## 
##     nasa
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
#Read in table from local folder
alc<-read.table(file = "data/alc.txt", sep="\t", header=TRUE)

#Explore the data using glimpse

glimpse(alc)
## Observations: 382
## Variables: 35
## $ school     <fct> GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP,...
## $ sex        <fct> F, F, F, F, F, M, M, F, M, M, F, F, M, M, M, F, F, ...
## $ age        <int> 18, 17, 15, 15, 16, 16, 16, 17, 15, 15, 15, 15, 15,...
## $ address    <fct> U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, ...
## $ famsize    <fct> GT3, GT3, LE3, GT3, GT3, LE3, LE3, GT3, LE3, GT3, G...
## $ Pstatus    <fct> A, T, T, T, T, T, T, A, A, T, T, T, T, T, A, T, T, ...
## $ Medu       <int> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4, 4, ...
## $ Fedu       <int> 4, 1, 1, 2, 3, 3, 2, 4, 2, 4, 4, 1, 4, 3, 2, 4, 4, ...
## $ Mjob       <fct> at_home, at_home, at_home, health, other, services,...
## $ Fjob       <fct> teacher, other, other, services, other, other, othe...
## $ reason     <fct> course, course, other, home, home, reputation, home...
## $ nursery    <fct> yes, no, yes, yes, yes, yes, yes, yes, yes, yes, ye...
## $ internet   <fct> no, yes, yes, yes, no, yes, yes, no, yes, yes, yes,...
## $ guardian   <fct> mother, father, mother, mother, father, mother, mot...
## $ traveltime <int> 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 3, 1, 2, 1, 1, 1, ...
## $ studytime  <int> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1, 3, ...
## $ failures   <int> 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ schoolsup  <fct> yes, no, yes, no, no, no, no, yes, no, no, no, no, ...
## $ famsup     <fct> no, yes, no, yes, yes, yes, no, yes, yes, yes, yes,...
## $ paid       <fct> no, no, yes, yes, yes, yes, no, no, yes, yes, yes, ...
## $ activities <fct> no, no, no, yes, no, yes, no, no, no, yes, no, yes,...
## $ higher     <fct> yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, y...
## $ romantic   <fct> no, no, no, yes, no, no, no, no, no, no, no, no, no...
## $ famrel     <int> 4, 5, 4, 3, 4, 5, 4, 4, 4, 5, 3, 5, 4, 5, 4, 4, 3, ...
## $ freetime   <int> 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4, 2, ...
## $ goout      <int> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4, 3, ...
## $ Dalc       <int> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ Walc       <int> 1, 1, 3, 1, 2, 2, 1, 1, 1, 1, 2, 1, 3, 2, 1, 2, 2, ...
## $ health     <int> 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2, 2, ...
## $ absences   <int> 5, 3, 8, 1, 2, 8, 0, 4, 0, 0, 1, 2, 1, 1, 0, 5, 8, ...
## $ G1         <int> 2, 7, 10, 14, 8, 14, 12, 8, 16, 13, 12, 10, 13, 11,...
## $ G2         <int> 8, 8, 10, 14, 12, 14, 12, 9, 17, 14, 11, 12, 14, 11...
## $ G3         <int> 8, 8, 11, 14, 12, 14, 12, 10, 18, 14, 12, 12, 13, 1...
## $ alc_use    <dbl> 1.0, 1.0, 2.5, 1.0, 1.5, 1.5, 1.0, 1.0, 1.0, 1.0, 1...
## $ high_use   <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FAL...
# Draw a bar plot of each variable
gather(alc) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar()
## Warning: attributes are not identical across measure variables;
## they will be dropped

The data has 382 observations and 35 variables from students whose alcohol consumption has been studied. The variables are either integers, numeric, factors or logicals. More information about the data can be found [here]https://archive.ics.uci.edu/ml/datasets/Student+Performance.

The data includes variables for sex, absences - absences from school, goout - going out and age. Usually there are differences between how much boys and girls drink due to social norms, absences from school also probably correlate with high use of alcohol, going out a lot also usually means higher alcohol consumption due to having more exposure to situations with alcohol and age might have an effect due to easier acccess to alcohol for older kids.

Logistic regression

Logistic regression is one type of generalized linear models where the response variable can only have two values for example true or false. In this data set the response variable we are interested is high alochol use, which can have true or false values.

Explore the relationship with high alcohol use and the chosen variables and the distribution of the variables

Explore how sex is related to high alcohol use. Our hypothesis is that males drink more than females.

#Explore the distributions of females and males in the data
g0<-ggplot(data = alc, aes(x=sex))

g0 + geom_bar()

#Explore how high use is dependent on sex using barplots
g1<-ggplot(data = alc, aes(x = high_use, fill = sex))

g1 + geom_bar()+facet_wrap("sex")

There are about equal numbers of males and females in the data. There seems to be more males who have high use compared to females, which makes sense if you think about the social norms hypothesis

Explore how absences are related to alcohol use using boxplots. Our hypothesis is that kids who are absent more are more likely to also drink more.

# Explore the distribution of absences
g6<-ggplot(alc, aes(x=absences, fill =sex))
g6+geom_bar() + ylab("absences") + ggtitle("Student absences by sex")

#Draw boxplots with high and low alcohol uses relation with absences
g3 <- ggplot(alc, aes(x=high_use, y=absences, col = sex))
g3 + geom_boxplot() +  ylab("absences") + ggtitle("Student absences by alcohol consumption and sex")

Absences are not normally distributed. There is a long tail and the most common value is 0. It seems that there are more absences in the kids with high alcohol use. This makes sense if you think that kids who drink more are also skipping more school. The patterns seem similar between females and males.

Explore how going out is related to alcohol use. Our hypothesis is that kids who are going out more are exposed to alcohol more and thus will also drink more.

#Explore the distibution of kids going out
g4 <- ggplot(alc, aes(x=goout, fill=sex))
g4 + geom_bar()

#Explore the relationship with high alcohol use
g5 <- ggplot(alc, aes(x=high_use, y=goout, col = sex))
g5 + geom_boxplot()  + ggtitle("Going out by alcohol consumption and sex")

Going out seems to be quite normally distributed between the values 1 and 5. However, there aren’t very many kids who don’t go out almost at all compared to the ones who go out very often. There is approximately equal numbers of both sexes in each going out class.

There definitely seems to be a relationship between going out and alcohol use. However, there might be some differences between going out often beign correlated with drinking more since in females there is overlap between the ones who go out often but don’t have high alcohol use with those who go out often and have high alcohol use.

Explore the relationship with age and alcohol use. Our hypothesis is that it is easier for older kids to get alcohol than for younger kids.

# Explore the distribution of age
g6<-ggplot(alc, aes(x=age, fill =sex))
g6+geom_bar()

# Explore the relationship of age  with alcohol use

g7 <- ggplot(alc, aes(y=age, x=high_use, col = sex))
g7 + geom_boxplot() + ylab("age") + ggtitle("Student age by alcohol consumption and sex")

Age is not normally distributed and there is quite a long tail for higher numbers after 19. Ages range from 15 to 22 with 16 being the median. For the values 18 and up, there is an approximately equal amount of females and males for each age.

There doesn’t seem to be a very obvious link between alcohol use and age that could be seen from the boxplots since the high use boxplots overlap with the low use bowplots. It might be that it is not very difficult to get alchol in the country of the kids even when you are younger than 18.

Build model with the variables that seemed interesting

Next we will build a GLM using binomial family with the interesting explanatory variables.

# Use generalized linear models to predict high use based on selected variables

m<-glm(high_use~age+sex+absences+goout, data=alc, family = "binomial")

#Print out summary and coefficients
summary(m)
## 
## Call:
## glm(formula = high_use ~ age + sex + absences + goout, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7877  -0.8026  -0.5393   0.8294   2.5201  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -5.60333    1.84395  -3.039 0.002376 ** 
## age          0.08951    0.11003   0.814 0.415905    
## sexM         0.96338    0.25490   3.779 0.000157 ***
## absences     0.08205    0.02247   3.651 0.000261 ***
## goout        0.71753    0.12058   5.951 2.67e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 387.09  on 377  degrees of freedom
## AIC: 397.09
## 
## Number of Fisher Scoring iterations: 4
#From the summary we can see that sex, absences and going out are significantly correlated with high alcohol use

coef(m)
## (Intercept)         age        sexM    absences       goout 
## -5.60333157  0.08951186  0.96338497  0.08204599  0.71752626
#From the coefficients we can calculate the odds ratio by taking the exponent function of the estimate


# compute odds ratios (OR)
OR <- coef(m) %>% exp

# compute confidence intervals (CI)
CI<- confint(m) %>% exp
## Waiting for profiling to be done...
# Join the two together

cbind(OR, CI)
##                      OR        2.5 %    97.5 %
## (Intercept) 0.003685565 9.268741e-05 0.1296785
## age         1.093640305 8.815211e-01 1.3582243
## sexM        2.620551969 1.599662e+00 4.3546202
## absences    1.085505735 1.040056e+00 1.1371493
## goout       2.049357367 1.626970e+00 2.6130756
  • From the odds ratio we can see that for every year the is approximately 9% increase in the chances of having high alcohol use, but the confidence interval includes one, which means that age is not significant.

  • We can also see that sex male means a 2.6 times more likely chance of having high alcohol use. Female is the intercept factor. The CI does not include 1 so this going out is also significant.

  • One more absence from school causes an approximately 9% bigger chance of having high use and this is significant also the CI doesn’t include 1.

  • Going out from a scale to 1 to 5 is significantly correlated with high alcohol use as the CI doesn’t include 1. An increasement of 1 causes an approximately 2.6 more likely chance of having high use.

Cross validation

After obtaining the model, we will use cross validation with a training set to estimate the error rate of the model. We will use the model without age since age was not significant.

# Fit the model without age
m<-glm(high_use~sex+absences+goout, data=alc, family = "binomial")


# Predict the probability of high alcohol use
probabilities <- predict(m, type = "response")

# Add the predicted probabilities to the alc table
alc <- mutate(alc, probability = probabilities)

# Use the probabilities to make a prediction of high alcohol use as probabilitites instead of odds
alc <- mutate(alc, prediction = probabilities>0.5)

# See the last ten original classes, predicted probabilities, and class predictions
select(alc, failures, absences, sex, high_use, probability, prediction) %>% tail(10)
##     failures absences sex high_use probability prediction
## 373        1        0   M    FALSE  0.14869987      FALSE
## 374        1        7   M     TRUE  0.39514446      FALSE
## 375        0        1   F    FALSE  0.13129452      FALSE
## 376        0        6   F    FALSE  0.18714923      FALSE
## 377        1        2   F    FALSE  0.07342805      FALSE
## 378        0        2   F    FALSE  0.25434555      FALSE
## 379        2        2   F    FALSE  0.07342805      FALSE
## 380        0        3   F    FALSE  0.03989428      FALSE
## 381        0        4   M     TRUE  0.68596604       TRUE
## 382        0        2   M     TRUE  0.09060457      FALSE
# From here you can see than when prob is >0.5 the prediction is "TRUE"

# Cross tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table %>% addmargins()
##         prediction
## high_use      FALSE       TRUE        Sum
##    FALSE 0.66230366 0.03926702 0.70157068
##    TRUE  0.17015707 0.12827225 0.29842932
##    Sum   0.83246073 0.16753927 1.00000000
# From here we can see that out of 65/318 predictions are wrong for predicted low alcohol use and 15/64 are wrong for the prediction of high alcohol use.

# Define a loss function to get the mean prediction error
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

# Call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2094241
# Do K-fold cross-validation using the loss_func defined previously
library(boot)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = 10)

# Average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2198953

There is approximately 22 % error in the prediction of high alcohol use using the model wiht going out, absences and age.


Week 4: Clustering and classification

This week we have learned about clustering, which is a handy tool for visualizing differences between data points. Clustering can be used for example for data exploration. We also learned about classification, which can be used to validate the results of clustering. Discriminant analysis was also touched upon.

Loading in the data and exploring it

We will load the dataset “Boston” from MASS package. The dataset includes data on crime rates per capita for towns in Boston with explanatory variables related to land use and inhabitant demographics.

# Access the MASS package
library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
library(dplyr)

# Load the data
data("Boston")

# Explore the dataset's dimentions and details of the variables
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

Explore dataset with corrplot

#L
library(corrplot)
## corrplot 0.84 loaded
# calculate the correlation matrix and round it
cor_matrix<-cor(Boston)%>%round(digits=2) 

# print the correlation matrix
cor_matrix
##          crim    zn indus  chas   nox    rm   age   dis   rad   tax
## crim     1.00 -0.20  0.41 -0.06  0.42 -0.22  0.35 -0.38  0.63  0.58
## zn      -0.20  1.00 -0.53 -0.04 -0.52  0.31 -0.57  0.66 -0.31 -0.31
## indus    0.41 -0.53  1.00  0.06  0.76 -0.39  0.64 -0.71  0.60  0.72
## chas    -0.06 -0.04  0.06  1.00  0.09  0.09  0.09 -0.10 -0.01 -0.04
## nox      0.42 -0.52  0.76  0.09  1.00 -0.30  0.73 -0.77  0.61  0.67
## rm      -0.22  0.31 -0.39  0.09 -0.30  1.00 -0.24  0.21 -0.21 -0.29
## age      0.35 -0.57  0.64  0.09  0.73 -0.24  1.00 -0.75  0.46  0.51
## dis     -0.38  0.66 -0.71 -0.10 -0.77  0.21 -0.75  1.00 -0.49 -0.53
## rad      0.63 -0.31  0.60 -0.01  0.61 -0.21  0.46 -0.49  1.00  0.91
## tax      0.58 -0.31  0.72 -0.04  0.67 -0.29  0.51 -0.53  0.91  1.00
## ptratio  0.29 -0.39  0.38 -0.12  0.19 -0.36  0.26 -0.23  0.46  0.46
## black   -0.39  0.18 -0.36  0.05 -0.38  0.13 -0.27  0.29 -0.44 -0.44
## lstat    0.46 -0.41  0.60 -0.05  0.59 -0.61  0.60 -0.50  0.49  0.54
## medv    -0.39  0.36 -0.48  0.18 -0.43  0.70 -0.38  0.25 -0.38 -0.47
##         ptratio black lstat  medv
## crim       0.29 -0.39  0.46 -0.39
## zn        -0.39  0.18 -0.41  0.36
## indus      0.38 -0.36  0.60 -0.48
## chas      -0.12  0.05 -0.05  0.18
## nox        0.19 -0.38  0.59 -0.43
## rm        -0.36  0.13 -0.61  0.70
## age        0.26 -0.27  0.60 -0.38
## dis       -0.23  0.29 -0.50  0.25
## rad        0.46 -0.44  0.49 -0.38
## tax        0.46 -0.44  0.54 -0.47
## ptratio    1.00 -0.18  0.37 -0.51
## black     -0.18  1.00 -0.37  0.33
## lstat      0.37 -0.37  1.00 -0.74
## medv      -0.51  0.33 -0.74  1.00
# visualize the correlation matrix
corrplot(cor_matrix, method="circle", type="upper", cl.pos="b", tl.pos="d", tl.cex=0.6)

There seems to be highest correlations with rad = index of accessibility to radial highways, tax=full-value property-tax rate per $10,000, lsat=lower status of the population (percent) and indus=proportion of non-retail business acres per town variables and lowest with medv=median value of owner-occupied homes in $1000s, black=1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town and dis=weighted mean of distances to five Boston employment centres. Some of the variables have only very few high or low values which might cause good correlations, but the correlations might not be significant.

Scaling the variables

Next we will scale the variables using the scale function.

# Center and standardize variables
boston_scaled <- scale(Boston)

# Summaries of the scaled variables
summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00
# Class of the boston_scaled object is matrix
class(boston_scaled)
## [1] "matrix"
# Change the matrix to a data frame
boston_scaled<-as.data.frame(boston_scaled)

# Summary of the scaled crime rate
summary(boston_scaled$crim)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -0.419367 -0.410563 -0.390280  0.000000  0.007389  9.924110
# Create a quantile vector of crim and print it
bins <- quantile(boston_scaled$crim)
bins
##           0%          25%          50%          75%         100% 
## -0.419366929 -0.410563278 -0.390280295  0.007389247  9.924109610
# Create a categorical variable 'crime' using the quantiles as break points
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, label=c("low", "med_low", "med_high", "high"))

# Look at the table of the new factor crime
table(crime)
## crime
##      low  med_low med_high     high 
##      127      126      126      127
# Remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)

# Add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)

The values are changed to distances from mean values (centering) and scaling by dividing the centered columns by their standard deviations.

Create training and test sets

Next we will create training and test sets from the data. We will use 80% of the data for training and then use the rest for testing our model.

# Number of rows in the Boston dataset 
n <- nrow(boston_scaled)

# Choose  80% of the rows randomly
ind <- sample(n,  size = n * 0.8)

# Create train set
train <- boston_scaled[ind,]

Linear discriminant analysis

Next we will fit a linear disriminant analysis for the data using the lda() command with all the explanatory variables.

# Linear discriminant analysis
lda.fit <- lda(crime ~ ., data = train)

# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2500000 0.2648515 0.2252475 0.2599010 
## 
## Group means:
##                  zn      indus        chas        nox          rm
## low       0.9052054 -0.9270727 -0.07742312 -0.8739586  0.41767359
## med_low  -0.1113640 -0.2785820 -0.05155709 -0.5627487 -0.13027235
## med_high -0.3732153  0.2021220  0.24684899  0.4263448  0.03355228
## high     -0.4872402  1.0149946 -0.04735191  1.0097692 -0.44213757
##                 age        dis        rad        tax     ptratio
## low      -0.8739481  0.8809642 -0.6782665 -0.7457859 -0.44319544
## med_low  -0.3527024  0.3694930 -0.5493177 -0.5071334 -0.06666156
## med_high  0.4258303 -0.4074853 -0.4253064 -0.3242890 -0.38096310
## high      0.8093148 -0.8603141  1.6596029  1.5294129  0.80577843
##               black       lstat        medv
## low       0.3751171 -0.75588422  0.49422203
## med_low   0.3215378 -0.09126862 -0.01586265
## med_high  0.1133669  0.01671381  0.17995945
## high     -0.6969331  0.88680741 -0.65315038
## 
## Coefficients of linear discriminants:
##                  LD1          LD2           LD3
## zn       0.076222036  0.691297244 -1.0101473257
## indus    0.092682523 -0.323048488  0.4199040082
## chas    -0.002013562 -0.032425417 -0.0001601388
## nox      0.240665472 -0.694457777 -1.2945315199
## rm       0.021724921 -0.001133757 -0.0155029037
## age      0.271946348 -0.321856645 -0.2257132590
## dis     -0.123572907 -0.165169041  0.2307296006
## rad      3.707256614  0.819247671 -0.0357350520
## tax     -0.008581646  0.080837112  0.3939823930
## ptratio  0.096747855  0.163437699 -0.2009819492
## black   -0.125312819  0.010536348  0.1432328533
## lstat    0.093274012 -0.143665019  0.5118510527
## medv     0.021324953 -0.319652774 -0.1160306458
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9588 0.0306 0.0106

From were we see that LDA1 explains >90% of the variance, which means that the model explains a great deal of the variation in the data. In the summary we can also see the means of the explanatory variables for different crime classes. For example rad and tax have quite big differences in their means between the low and high crime rates.

Drawing the LDA biplot

Next we will draw a plot and see how the different towns group in clustering. We will color and annotate the dots with crime rates and draw arrows for directions and weights of the different explanatory variables.

# The function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# Target classes as numeric
classes <- as.numeric(train$crime)

# Plot the lda results
plot(lda.fit, dimen = 2,col = classes, pch=classes)
lda.arrows(lda.fit, myscale = 2)

From the plot it seems that rad (accessibility to radial highways) has the biggest impact on grouping. There are two clusters in the plot with one of them having almost al the highest crime rate towns and some medium high rates but none of the medium low or low crime rate towns. Seems like the model might only need rad to predict the crime rates.

Test set

Next we will validate the results using the test set we made earlier. We will predict crime rate classes with test data using the lda.fit model we created with the training set.

# Create test set 
test <- boston_scaled[-ind,]

# Save the correct classes from test data
correct_classes <- test$crime

# Remove the crime variable from test data
test <- dplyr::select(test, -crime)

## Predict classes with test data using the lda.fit model we created with the training set.
lda.pred <- predict(lda.fit, newdata = test)

# Cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       15       9        2    0
##   med_low    4      11        4    0
##   med_high   0      14       19    2
##   high       0       0        1   21

From the cross tablulation, we can see that 20/20 high crime rates were predicted correctly to be high. However, two med_high cases were incorrectly classified as high. This is pretty good! For the low class, all of the correct lows were predicted to be lows, but 12 were predicted as med_low and one as meg_high.

The model strugles with correct classification of the med_low and med_high classes. Only six of the med_high classes was correctly classified and 20/26 were predicted to be some other class. Most were predicted to be med_low, but two were high and one low.

The model does a bit better with the med_low class. 17/28 are correctly classified as med_low, and five as low and 6 as med_high.

Predicting the best number of clusters in a dataset

Next we will reload the Boston dataset, scale it and use k-means algorithm to predict the optimal number of clusters in the dataset.

# Obtain dataset
data('Boston')

# Change the matrix to a data frame
boston_scaled2<-as.data.frame(Boston)

# K-means clustering
km <-kmeans(boston_scaled2, centers = 3)

# Plot the Boston dataset with clusters, use first seven variables
pairs(boston_scaled2[1:14], col = km$cluster)

The variables rad and tax seem to cluster the data pretty nicely into two clusters in the crime vs rad and crime vs tax plots. Three clusters might not be needed, but two might suffice. Let’s check this next.

Determining the optimal number of clusters

We will determine the number of clusters is to look by how the total of within cluster sum of squares (WCSS) behaves when the number of cluster changes. The best number of clusters is when the total WCSS drops dramatically, which can be observed in the plot.

# MASS, ggplot2 and Boston dataset are available
set.seed(123)

# Determine the maximun number of clusters
k_max <- 10

# Calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled2, k)$tot.withinss})

# Visualize the results with qplot
qplot(x = 1:k_max, y = twcss, geom = 'line')

# k-means clustering
km <-kmeans(boston_scaled2, centers = 2)

# Plot the Boston dataset with clusters
pairs(boston_scaled2, col = km$cluster)

Like we observed in the previous pairs plot, based on the qplot and WCSS it seems that the optimal number of clusters is 2. Rad and tax produce the best separation betweem the classes.


Week 5: Dimensionality reduction techniques

This week we have learned about dimensionality reduction tehcniques, which is a handy tool for visualizing differences between data points in very few dimensions when the actual data is complex. These methods in clude PCA and CA. Usually results are visualised in two dimensions for the sake of understandability.

Reading in data and graphical representation

First we will load in the human dataset with different variables for related to economy, education and health for countries.

# Read in table
human<-read.table("data/human.txt", header=TRUE, row.names = 1, sep = "\t")
str(human)
## 'data.frame':    155 obs. of  8 variables:
##  $ Edu2.FM  : num  1.007 0.997 0.983 0.989 0.969 ...
##  $ Labo.FM  : num  0.891 0.819 0.825 0.884 0.829 ...
##  $ Life.Exp : num  81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
##  $ Edu.Exp  : num  17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
##  $ GNI      : int  64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
##  $ Mat.Mor  : int  4 6 6 5 6 7 9 28 11 8 ...
##  $ Ado.Birth: num  7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
##  $ Parli.F  : num  39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
# Access GGally
library(GGally)

# Visualize the variables
ggpairs(human)

# Create summaries of the variables

summary(human)
##     Edu2.FM          Labo.FM          Life.Exp        Edu.Exp     
##  Min.   :0.1717   Min.   :0.1857   Min.   :49.00   Min.   : 5.40  
##  1st Qu.:0.7264   1st Qu.:0.5984   1st Qu.:66.30   1st Qu.:11.25  
##  Median :0.9375   Median :0.7535   Median :74.20   Median :13.50  
##  Mean   :0.8529   Mean   :0.7074   Mean   :71.65   Mean   :13.18  
##  3rd Qu.:0.9968   3rd Qu.:0.8535   3rd Qu.:77.25   3rd Qu.:15.20  
##  Max.   :1.4967   Max.   :1.0380   Max.   :83.50   Max.   :20.20  
##       GNI            Mat.Mor         Ado.Birth         Parli.F     
##  Min.   :   581   Min.   :   1.0   Min.   :  0.60   Min.   : 0.00  
##  1st Qu.:  4198   1st Qu.:  11.5   1st Qu.: 12.65   1st Qu.:12.40  
##  Median : 12040   Median :  49.0   Median : 33.60   Median :19.30  
##  Mean   : 17628   Mean   : 149.1   Mean   : 47.16   Mean   :20.91  
##  3rd Qu.: 24512   3rd Qu.: 190.0   3rd Qu.: 71.95   3rd Qu.:27.95  
##  Max.   :123124   Max.   :1100.0   Max.   :204.80   Max.   :57.50

Many of the variables have long tails so they might not be normally distributed. Education expextancy seems to be roughly normally distributed.

Mean life expectancy is 71.7 years and education expectancy 13.2 years. There are big differences in the GNI values of different countries. Minimum is 581, median is 12040 and maximum is 123124.

Maternal mortality and adolescent birth rate are strongly correlated with each other and so are life expentancy and education expectancy.

Principal component analysis

Next we will plot the variables in a two dimensional space using principal component analysis which compresses the variation in dimensions. The dimensions are ranked from one upwards, and the first dimension or principal componen captures the largest amount of variation in the date and the second the secord largest amount of variation.

Non-standardized data

# Perform principal component analysis on non-standardized data
pca_human <- prcomp(human)

# Create and print out a summary of pca_human
s <- summary(pca_human)
s
## Importance of components:
##                              PC1      PC2   PC3   PC4   PC5   PC6    PC7
## Standard deviation     1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912
## Proportion of Variance 9.999e-01   0.0001  0.00  0.00 0.000 0.000 0.0000
## Cumulative Proportion  9.999e-01   1.0000  1.00  1.00 1.000 1.000 1.0000
##                           PC8
## Standard deviation     0.1591
## Proportion of Variance 0.0000
## Cumulative Proportion  1.0000
# Rounded percetanges of variance captured by each PC
pca_pr <- round(1*s$importance[2,]*100, digits = 1) 

# Print out the percentages of variance
pca_pr
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 
## 100   0   0   0   0   0   0   0
# Create object pc_lab to be used as axis labels
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)") 

# Draw a biplot
biplot(pca_human, cex = c(0.8, 0.1), col = c("grey40", "deeppink2"), xlab = pc_lab[1], ylab = pc_lab[2])
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
# Perform principal component analysis on non-standardized data
pca_human <- prcomp(human)

# Create and print out a summary of pca_human
s <- summary(pca_human)
s
## Importance of components:
##                              PC1      PC2   PC3   PC4   PC5   PC6    PC7
## Standard deviation     1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912
## Proportion of Variance 9.999e-01   0.0001  0.00  0.00 0.000 0.000 0.0000
## Cumulative Proportion  9.999e-01   1.0000  1.00  1.00 1.000 1.000 1.0000
##                           PC8
## Standard deviation     0.1591
## Proportion of Variance 0.0000
## Cumulative Proportion  1.0000
# Rounded percetanges of variance captured by each PC
pca_pr <- round(1*s$importance[2,]*100, digits = 1) 

# Print out the percentages of variance
pca_pr
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 
## 100   0   0   0   0   0   0   0
# Create object pc_lab to be used as axis labels
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)") 

# Draw a biplot
biplot(pca_human, cex = c(0.8, 0.1), col = c("grey40", "deeppink2"), xlab = pc_lab[1], ylab = pc_lab[2])
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

Standardized data

Next we will repreat the analysis for standardized data.

# Standardize the variables
human_std <- scale(human)


# Print out summaries of the standardized variables
summary(human_std)
##     Edu2.FM           Labo.FM           Life.Exp          Edu.Exp       
##  Min.   :-2.8189   Min.   :-2.6247   Min.   :-2.7188   Min.   :-2.7378  
##  1st Qu.:-0.5233   1st Qu.:-0.5484   1st Qu.:-0.6425   1st Qu.:-0.6782  
##  Median : 0.3503   Median : 0.2316   Median : 0.3056   Median : 0.1140  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5958   3rd Qu.: 0.7350   3rd Qu.: 0.6717   3rd Qu.: 0.7126  
##  Max.   : 2.6646   Max.   : 1.6632   Max.   : 1.4218   Max.   : 2.4730  
##       GNI             Mat.Mor          Ado.Birth          Parli.F       
##  Min.   :-0.9193   Min.   :-0.6992   Min.   :-1.1325   Min.   :-1.8203  
##  1st Qu.:-0.7243   1st Qu.:-0.6496   1st Qu.:-0.8394   1st Qu.:-0.7409  
##  Median :-0.3013   Median :-0.4726   Median :-0.3298   Median :-0.1403  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.3712   3rd Qu.: 0.1932   3rd Qu.: 0.6030   3rd Qu.: 0.6127  
##  Max.   : 5.6890   Max.   : 4.4899   Max.   : 3.8344   Max.   : 3.1850
# Perform principal component analysis
pca_human <- prcomp(human_std)


# Create and print out a summary of pca_human
s <- summary(pca_human)
s
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6
## Standard deviation     2.0708 1.1397 0.87505 0.77886 0.66196 0.53631
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595
## Cumulative Proportion  0.5361 0.6984 0.79413 0.86996 0.92473 0.96069
##                            PC7     PC8
## Standard deviation     0.45900 0.32224
## Proportion of Variance 0.02634 0.01298
## Cumulative Proportion  0.98702 1.00000
# Rounded percetanges of variance captured by each PC
pca_pr <- round(1*s$importance[2,]*100, digits = 1) 

# print out the percentages of variance
pca_pr
##  PC1  PC2  PC3  PC4  PC5  PC6  PC7  PC8 
## 53.6 16.2  9.6  7.6  5.5  3.6  2.6  1.3
# Create object pc_lab to be used as axis labels
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)") 

# Draw a biplot
biplot(pca_human, cex = c(0.8, 0.8), col = c("grey40", "deeppink2"), xlab = pc_lab[1], ylab = pc_lab[2])

Seems like the variable GNI (gross national income) is explaining all the variance (100%) and PC2 explains none of the variance (0%) in the not scaled data. The countries with the highest GNI are on the left and ones with the lowest GNI are on the right.

In the scaled data. PC1 explains 53.6% of the variance. Variables which explain changes in the direction of PC1 are education expextancy, gross national income, education rate and life expectancy, which are all positively correlated with each other, as well as maternal mortality rate and adolescent birth rate which are positively correlated with each other but negatively correlated with the before mentioned variables.

PC2 explains 16.2% of the variance and female to male labour rate and percentage of females in the parliament explain variance in this principal component.

Scaling made a big difference in the results since, GNI was driving most of the differences in the data because there where so huge differences in the GNIs of different countries. When we used scaling we could see the effect of other variables as well.

Personal interpretations of the first two principal component dimensions

PC1 separarates countries based on GNI, education, life expectancy and maternal mortality and teenage birth rates. Most of the extreme values with low GNI, life expextancy etc, are African countries.

Countries on the opposite scale of the PC1 include many European countries, USA, Japan, South Korea and arab states.

From the scaled figure you can see quite nicely that for example Nordic countries group together. They have quite high life expectancy, GNI, education expectancy, education rate and also female labour rate and percentage of females in the parliament. Then there are also countries where the GNI, etc. variables are quite high but the female labour rate and percentage of females in the parliament are low like Qatar, Kuwait, Saudi Arabia, which are oil producing arabic countries, which makes the countries separate on the PC2 axis.

Many African countries are have high female labour rates and percentages of females in the parliament, but don’t have that high education, life expectancy and GNI. There is division between African countries based on the second component into countries with low and high female labour rates and percentages of females in the parliament.

Multiple correspondence analysis

Loading in data and data exploration

First we will load the tea dataset and explore how the data looks.

# Load library and tea dataset

library(FactoMineR)
data(tea)

# Explore tea dataset
str(tea)
## 'data.frame':    300 obs. of  36 variables:
##  $ breakfast       : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
##  $ tea.time        : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
##  $ evening         : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
##  $ lunch           : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dinner          : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
##  $ always          : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
##  $ home            : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
##  $ work            : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
##  $ tearoom         : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ friends         : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ resto           : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
##  $ pub             : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Tea             : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How             : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ sugar           : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ how             : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ where           : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ price           : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
##  $ age             : int  39 45 47 23 48 21 37 36 40 37 ...
##  $ sex             : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##  $ SPC             : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
##  $ Sport           : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
##  $ age_Q           : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
##  $ frequency       : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
##  $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
##  $ spirituality    : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
##  $ healthy         : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
##  $ diuretic        : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
##  $ friendliness    : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
##  $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ feminine        : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
##  $ sophisticated   : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
##  $ slimming        : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ exciting        : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
##  $ relaxing        : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
dim(tea)
## [1] 300  36
summary(tea)
##          breakfast           tea.time          evening          lunch    
##  breakfast    :144   Not.tea time:131   evening    :103   lunch    : 44  
##  Not.breakfast:156   tea time    :169   Not.evening:197   Not.lunch:256  
##                                                                          
##                                                                          
##                                                                          
##                                                                          
##                                                                          
##         dinner           always          home           work    
##  dinner    : 21   always    :103   home    :291   Not.work:213  
##  Not.dinner:279   Not.always:197   Not.home:  9   work    : 87  
##                                                                 
##                                                                 
##                                                                 
##                                                                 
##                                                                 
##         tearoom           friends          resto          pub     
##  Not.tearoom:242   friends    :196   Not.resto:221   Not.pub:237  
##  tearoom    : 58   Not.friends:104   resto    : 79   pub    : 63  
##                                                                   
##                                                                   
##                                                                   
##                                                                   
##                                                                   
##         Tea         How           sugar                     how     
##  black    : 74   alone:195   No.sugar:155   tea bag           :170  
##  Earl Grey:193   lemon: 33   sugar   :145   tea bag+unpackaged: 94  
##  green    : 33   milk : 63                  unpackaged        : 36  
##                  other:  9                                          
##                                                                     
##                                                                     
##                                                                     
##                   where                 price          age        sex    
##  chain store         :192   p_branded      : 95   Min.   :15.00   F:178  
##  chain store+tea shop: 78   p_cheap        :  7   1st Qu.:23.00   M:122  
##  tea shop            : 30   p_private label: 21   Median :32.00          
##                             p_unknown      : 12   Mean   :37.05          
##                             p_upscale      : 53   3rd Qu.:48.00          
##                             p_variable     :112   Max.   :90.00          
##                                                                          
##            SPC               Sport       age_Q          frequency  
##  employee    :59   Not.sportsman:121   15-24:92   1/day      : 95  
##  middle      :40   sportsman    :179   25-34:69   1 to 2/week: 44  
##  non-worker  :64                       35-44:40   +2/day     :127  
##  other worker:20                       45-59:61   3 to 6/week: 34  
##  senior      :35                       +60  :38                    
##  student     :70                                                   
##  workman     :12                                                   
##              escape.exoticism           spirituality        healthy   
##  escape-exoticism    :142     Not.spirituality:206   healthy    :210  
##  Not.escape-exoticism:158     spirituality    : 94   Not.healthy: 90  
##                                                                       
##                                                                       
##                                                                       
##                                                                       
##                                                                       
##          diuretic             friendliness            iron.absorption
##  diuretic    :174   friendliness    :242   iron absorption    : 31   
##  Not.diuretic:126   Not.friendliness: 58   Not.iron absorption:269   
##                                                                      
##                                                                      
##                                                                      
##                                                                      
##                                                                      
##          feminine             sophisticated        slimming  
##  feminine    :129   Not.sophisticated: 85   No.slimming:255  
##  Not.feminine:171   sophisticated    :215   slimming   : 45  
##                                                              
##                                                              
##                                                              
##                                                              
##                                                              
##         exciting          relaxing              effect.on.health
##  exciting   :116   No.relaxing:113   effect on health   : 66    
##  No.exciting:184   relaxing   :187   No.effect on health:234    
##                                                                 
##                                                                 
##                                                                 
##                                                                 
## 

There are 300 observations and 36 variables. Most variables are factors except age which is an integer. ###MCA analysis Then we will do the multiple correspondence analysis with a subset of the variables

# Access dplyr

library(dplyr)

# column names to keep in the dataset
keep_columns <- c("Tea", "How", "how", "sugar", "where", "lunch")

# select the 'keep_columns' to create a new dataset
tea_time <- dplyr::select(tea, one_of(keep_columns))

# look at the summaries and structure of the data
str(tea_time)
## 'data.frame':    300 obs. of  6 variables:
##  $ Tea  : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How  : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ how  : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ sugar: Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ where: Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ lunch: Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
# multiple correspondence analysis
mca <- MCA(tea_time, graph = FALSE)

# summary of the model
summary(mca)
## 
## Call:
## MCA(X = tea_time, graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6
## Variance               0.279   0.261   0.219   0.189   0.177   0.156
## % of var.             15.238  14.232  11.964  10.333   9.667   8.519
## Cumulative % of var.  15.238  29.471  41.435  51.768  61.434  69.953
##                        Dim.7   Dim.8   Dim.9  Dim.10  Dim.11
## Variance               0.144   0.141   0.117   0.087   0.062
## % of var.              7.841   7.705   6.392   4.724   3.385
## Cumulative % of var.  77.794  85.500  91.891  96.615 100.000
## 
## Individuals (the 10 first)
##                       Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
## 1                  | -0.298  0.106  0.086 | -0.328  0.137  0.105 | -0.327
## 2                  | -0.237  0.067  0.036 | -0.136  0.024  0.012 | -0.695
## 3                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 4                  | -0.530  0.335  0.460 | -0.318  0.129  0.166 |  0.211
## 5                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 6                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 7                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 8                  | -0.237  0.067  0.036 | -0.136  0.024  0.012 | -0.695
## 9                  |  0.143  0.024  0.012 |  0.871  0.969  0.435 | -0.067
## 10                 |  0.476  0.271  0.140 |  0.687  0.604  0.291 | -0.650
##                       ctr   cos2  
## 1                   0.163  0.104 |
## 2                   0.735  0.314 |
## 3                   0.062  0.069 |
## 4                   0.068  0.073 |
## 5                   0.062  0.069 |
## 6                   0.062  0.069 |
## 7                   0.062  0.069 |
## 8                   0.735  0.314 |
## 9                   0.007  0.003 |
## 10                  0.643  0.261 |
## 
## Categories (the 10 first)
##                        Dim.1     ctr    cos2  v.test     Dim.2     ctr
## black              |   0.473   3.288   0.073   4.677 |   0.094   0.139
## Earl Grey          |  -0.264   2.680   0.126  -6.137 |   0.123   0.626
## green              |   0.486   1.547   0.029   2.952 |  -0.933   6.111
## alone              |  -0.018   0.012   0.001  -0.418 |  -0.262   2.841
## lemon              |   0.669   2.938   0.055   4.068 |   0.531   1.979
## milk               |  -0.337   1.420   0.030  -3.002 |   0.272   0.990
## other              |   0.288   0.148   0.003   0.876 |   1.820   6.347
## tea bag            |  -0.608  12.499   0.483 -12.023 |  -0.351   4.459
## tea bag+unpackaged |   0.350   2.289   0.056   4.088 |   1.024  20.968
## unpackaged         |   1.958  27.432   0.523  12.499 |  -1.015   7.898
##                       cos2  v.test     Dim.3     ctr    cos2  v.test  
## black                0.003   0.929 |  -1.081  21.888   0.382 -10.692 |
## Earl Grey            0.027   2.867 |   0.433   9.160   0.338  10.053 |
## green                0.107  -5.669 |  -0.108   0.098   0.001  -0.659 |
## alone                0.127  -6.164 |  -0.113   0.627   0.024  -2.655 |
## lemon                0.035   3.226 |   1.329  14.771   0.218   8.081 |
## milk                 0.020   2.422 |   0.013   0.003   0.000   0.116 |
## other                0.102   5.534 |  -2.524  14.526   0.197  -7.676 |
## tea bag              0.161  -6.941 |  -0.065   0.183   0.006  -1.287 |
## tea bag+unpackaged   0.478  11.956 |   0.019   0.009   0.000   0.226 |
## unpackaged           0.141  -6.482 |   0.257   0.602   0.009   1.640 |
## 
## Categorical variables (eta2)
##                      Dim.1 Dim.2 Dim.3  
## Tea                | 0.126 0.108 0.410 |
## How                | 0.076 0.190 0.394 |
## how                | 0.708 0.522 0.010 |
## sugar              | 0.065 0.001 0.336 |
## where              | 0.702 0.681 0.055 |
## lunch              | 0.000 0.064 0.111 |
# visualize MCA
plot(mca, invisible=c("ind"), habillage="quali")

From the plot you can see that people who drink unpackaged tea usually shop in teashops and people who buy it fro chain stores prefer teabags and people who drink both also shop in both types of shops.